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ABSTRACT 



This paper addresses the controversy about use of 
"out-of -level" testing, the practice of assessing students (usually those 
with disabilities) with a lower-level version of a test. The controversy pits 
unintended instructional consequences against "accurately" measuring 
performance and avoiding student frustration. Introductory sections explain 
what out-of-level testing is and offer a brief history of its use. Next, 
arguments for out-of- level testing are offered, including avoidance of 
student frustration and emotional trauma; improved accuracy of measurement; 
and better measurement when the context of the test matches the student ' s 
instructional level. Arguments against out-of -level testing stress that 
assessments must be consistent with the purpose for which they are used and 
that out-of-level testing reflects low expectations for students and 
negatively affects their instruction. Next, five assumptions for out-of- level 
testing and objections to these assumptions are listed. Three considerations 
in using out-of-level testing for individual students are identified: (1) 

performance on grade level assessment is likely to be spuriously higher than 
on out-of -level assessments; (2) instructional issues need to be addressed 
before students are placed in out-of- level tests; and (3) unintended 
consequences of out -of -level testing include never reaching grade- level or 
passing a high stakes test. Finally, questions for decision makers to 
consider before using out-of- level tests are suggested. (DB) 



Reproductions supplied by EDRS are the best that can be made 
from the original document. 



bbLL 0£T 



NCEO POLICY DIRECTIONS 



as 

© 

«n 

o 



Q 



w 



NUMBER 9 
APRIL 1999 

OUT-OF-LEVEL TESTING: PROS AND CONS 



U.S. DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE OF EDUCATION 

EDUCATIONAL RESOURCES INFORMATION 
/ CENTER (ERIC) 

Q/fhis document has been reproduced as 



received from the person or organization 
originating It. 



□ Minor changes have been made to improve 
reproduction quality. 



© Points of view or opinions stated in this docu- 
ment do not necessarily represent official NIE 
position or policy. 




2 



BEST COPY AVAILABLE 




Number 9 



April 1999 



Out-of-Level Testing: Pros and Cons 




Background 



Whether called "out-of-level," "off- 
grade-level," "functioning-level," or 
"instructional-level" testing, the 
practice of assessing students using 
a lower-level version of a test is 
controversial. The controversy pits 
unintended instructional conse- 
quences against "accurately" 
measuring performance and avoid- 
ing student frustration. The contro- 
versy also reflects beliefs about the 
appropriateness of delivering 
instruction at a student's perceived 
functional level rather than adapt- 
ing on-grade-level instruction to the 
specific needs of the student. The 
out-of-level testing controversy is 
particularly pertinent to students 
with disabilities, who typically are 
functioning at lower performance 
levels than their peers, and who, as 
a result of changes in federal educa- 
tion laws, must participate in state 
and district assessments. 



To explore the controversy of out- 
of-level testing, and assist educators 
and policymakers in making appro- 
priate decisions about its use in 
large-scale assessments, we de- 
scribe its meaning and its history, 
then discuss arguments for and 
O 




against its use. We conclude with 
several important considerations 
and questions to ask before imple- 
menting an out-of-level testing 
policy or administering an out-of- 
level test to a student. 




What is Out-of-Level 
Testing? 



Out-of-level testing is a term used 
to mean that a student who is in 
one grade is assessed using a level 
of a test that was developed for 
students in another grade. Lower- 
level testing is almost universally 
what is meant when terms like 
"out-of-level," "off-grade level," 
and "instructional-level" are used. 



The use of out-of-level testing in 
large-scale assessment programs 
has increased during the past 10 
years. Generally it is presented in 
policy as an accommodation or 
modification for students with 
disabilities. (See Table 1 on next 
page for trends in the use of out- 
of-level testing.) State policies 
often warn that scores from out-of- 
level testing are to be interpreted 
with caution; usually, out-of-level 
tests can be used only for students 
with disabilities. 
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History of Out-of- 
Level Testing 



Out-of-level testing first emerged 
in norm-referenced testing. Norm- 
referenced tests (NRTs) were 
developed with forms for different 
grade levels. Originally, it was 
intended that a child would be 
given the form that corresponded 
to that child's grade level. 



Following procedures used in 
individualized testing, it was 
sometimes decided to use the same 
procedures for group testing — to 
select for a student the form that 
corresponded to that student's 
functional skill level. The decision 
about which form to use was based 
on other information about the 
student, such as assessed reading 
level, teacher judgment of instruc- 
tional level, and so on. 



These approaches reflect some of 
the same ideas as those that are 
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Table 1. Trends in the Use of Out-of-Level Testing 



| ■ f States Allowing Out-of-Level Testing 


1993 a 


1995 b 


1997 c 


Georgia 


Connecticut 

Georgia 

Kansas 

North Carolina 
Oregon 


Alaska 

Connecticut 

Georgia 

Maine 

Missouri 

New Hampshire 

New York 

North Dakota 

Vermont 

West Virginia 



8 From Thurlow, Ysseldyke, & Silverstein (1993). 
b From Thurlow, Scott, & Ysseldyke (1995). 
c From Roeber, Bond, & Connealy (1998). 



used in individualized intelligence 
and achievement testing. They are 
also the basis for computer- 
adapted testing in which perfor- 
mance on selected test items leads 
to a branch of items that start with 
those the student can answer 
correctly, regardless of the diffi- 
culty levels of the items. In fact, 
out-of-level testing has been called 
the "poor man's version of com- 
puter-adapted testing." Although 
out-of-level testing grew out of 
norm-referenced testing, it soon 
was being applied to criterion- 
referenced tests (CRTs). 

Perceived Pros and 
Cons of Out-of-Level 
Testing 

There are both pros and cons 
associated with out-of-level test- 
ing. They reflect different perspec- 
tives on large-scale testing and the 
connection between instruction 
and tests developed to assess the 
results of instruction. 

► Arguments For Out-of-Level 
Testing 

Individuals who argue the pro side 



O 



of out-of-level testing generally 
cite three types of benefits: (1) 
avoiding undue frustration for the 
student, (2) improving the accu- 
racy of measurement, and (3) 
better matching the student's 
current educational goals and 
instructional level. It is suggested 
that it is unfair for students who 
are not performing at grade-level 
to be subjected to grade-level tests. 

Avoiding student frustration and 
emotional trauma are common 
arguments for out-of-level testing. 
Being tested at grade-level when 
not performing at this level is 
considered to be too emotionally 
traumatizing, and traumatization 
from the testing experience is 
thought to increase exponentially 
as the difference increases between 
the student's grade and the grade 
at which the student is function- 
ing. Those in favor of out-of-level 
testing also argue that out-of-level 
testing actually is the most hu- 
mane approach for students not 
performing well in school. Stu- 
dents are not forced to dwell on 
their errors, but rather are pro- 
vided with test items to which they 



can respond in a reasonable man- 
ner. 

Improved accuracy of measure- 
ment is also given as a reason for 
out-of-level testing. Psychometric 
support for out-of-level testing cites 
the over-statement of actual perfor- 
mance that occurs when there are 
many chance-level scores for stu- 
dents assessed at their grade level 
(Doscher & Bruno, 1981; Wick, 

1983). This means that the perfor- 
mance of students looks better than 
it actually is when grade-level 
assessments are used. 

Better measurement occurs when 
the context of the test matches the 
student's instructional level. It is 

generally recognized that the focus 
of out-of-level testing may not be 
the same as the grade-level goals. 
Still, out-of-level tests are said to 
accurately measure the student's 
intermediate goals on the pathway 
to the grade-level standards. 

►Arguments Against Out-of- 
Level Testing 

Individuals who argue against the 
use of out-of-level testing generally 
focus on the purpose of assess- 
ments and concerns about expecta- 
tions and instruction for students. 

In addition, there are specific 
responses to some of the arguments 
made by those supporting the use 
of out-of-level testing (see Table 2). 

Assessments must be consistent 
with the purpose for which they 
are being used. Although out-of- 
level testing may be appropriate for 
making instructional decisions (e.g., 
knowing what skills the student has 
now so that plans can be made 
about what to teach next), it is 
viewed as inappropriate for ac- 
countability assessments. State and 
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district assessments almost always 
are used for accountability pur- 
poses — to describe what students 
know and can do in relation to a 
set of standards and to evaluate 
how schools and programs are 
progressing in providing students 
with desired knowledge and skills. 
Testing at a lower grade level does 
not reflect the student's perfor- 
mance at the standard being 
assessed for the majority of stu- 
dents. 

Out-of-level testing reflects low 
expectations for students and 
negatively affects their instruc- 
tion. Too often, expectations for 
students who have not performed 
well in the past are below what 
they should be, creating a never- 
ending cycle of low expectations 
resulting in lower performance, 
which in turn results in even lower 
expectations. There are many 
instances of teachers being sur- 



prised by how well students 
performed when they were tested 
at grade level. There are related 
concerns about what happens in 
instruction when out-of-level 
approaches are used. It may be 
assumed that what the student is 
being tested on is all that the 
student needs to learn, with the 
resulting instruction focusing on 
lower-level standards than those 
toward which the student should 
be striving. 

Assumptions of Out- 
of-Level Testing 

There are five assumptions that 
test developers say should be met 
before out-of-level testing is con- 
sidered an appropriate adaptation 
of testing. And, there are objections 
to the appropriateness of each of 
the assumptions (see Table 3 on 
next page). 



Considerations in 
Using Out-of-Level 
Testing 

Three considerations derived from 
research are important when 
thinking about the use of out-of- 
level testing either for a system or 
for individual students: 

Performance on grade-level 
assessments is likely to be spuri- 
ously higher than on out-of-level 
assessments. Doscher and Bruno 
(1981) identified this trend in a 
simulation of test performance, 
noting that "results show test 
scores to be overstatements of 
subject mastery, with larger distor- 
tions at the lower achievement 
levels" (p. 475). Wick (1983) con- 
firmed this tendency using actual 
standardized test results from the 
Chicago Public Schools, where 
frequent occurrences of chance 
scores produced overstatements of 
actual performance. Wick pro- 
duced data showing that a move 
from on-grade testing to functional 
testing resulted in lower scores, 
with the negative impact increas- 
ing as the students' grade-level 
increased. 

Instructional issues need to be 
addressed before students are 
placed in out-of-level tests. Too 

often, assessments are seen as 
entities in themselves, unrelated to 
the instruction that is to be re- 
flected in test performance. Discus- 
sions of out-of-level testing must 
return to discussions about out-of- 
level teaching. For some time, we 
have known that assessments are 
linked in varying degrees to the 
curricula that they reflect. 1 A 
disconnect between the two at any 
point can lead to questions about 



Table 2. Arguments For and Against The Use of Out-of-Level 
Testing 



Pro Arguments 


Rebuttals to Pro Arguments 


Avoids student frustration and 
emotional trauma. This is the 
humane approach for students not 
performing well in school. 


Can instruction that does not address 
needed erade-level material be 
thought of as humane? Trauma will 
be a non-issue if instruction is 
consistent with the difficulty of the 
assessment. 


Improves accuracy of measurement. 


How can a test that does not address 
"grade-level" materials be more or 
less accurate than chance scores? 


Better matches the student's current 
educational goals and instructional 
level. 


Is it honest for a test to conform to 
where it is perceived a child is rather 
than match what we want the child 
to know and be able to do? 
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Table 3. Assumptions For Out-of-Level Testing 



Assumptions for Out-of-Level Testing 


Objections to Assumptions & 


The performance of any small subgroup (such as 
students with disabilities) will not have a significant 
effect on test statistics. Although it is desirable to 
include a wide range of students during test develop- 
ment, it is not necessary that small subgroups be in- 
cluded. 


Reporting the results for students with disabilities is 
required by law. To be able to report accurately on the 
performance of what may seem to be a small group of 
students, their inclusion during test development is 
critical. 


The test has levels, with each successive level reflect- 
ing more difficult content on the same scale. The levels 
of a test typically correspond to grade levels, and the 
performance of students in the targeted grade, and in 
nearby grades, can be represented as a distribution of 
scores reflecting increasing difficulty For out-of-level 
testing, different levels must be on the same scale, which 
is most easily achieved when the test has an Item 
Response Theory (IRT) basis. 2 


Levels-based tests reflect different content as well as 
more difficult content. Creating test levels that are on the 
same scale of difficulty does not adjust for the different 
content that may be included in different levels. Assum- 
ing that the content is the same can lead to erroneous 
conclusions (e.g., a grade 3 scale score used as if it were a 
grade 5 scale score might imply that a student has some 
mastery of long division when the grade 3 test does not 
include any items on long division). 


The student is in the same scope and sequence as other 
students in the same grade. This means that the student 
is working on the same academic content as the focus of 
the assessment (same standards), although it may be at a 
much lower level For example, out-of-level testing on a 
reading test can be considered for a student who is 
learning to read, even if at a lower level than other 
students in the same grade, but not for a student who is 
learning feeding or other self-care skills. 


The delineation of what is in the same scope and 
sequence is not clear. If reading is interpreted broadly, a 
student who is learning to recognize letters (pre-reading 
skill) could be considered to be in the same scope and 
sequence as students who are learning to read narrative 
writing. Typical state and district tests do not cover such 
a broad range of skills. Furthermore, having broad skills 
does not fix the problem of different content. 


Out-of-level testing is appropriate for system, not 
student, accountability. When the focus of the assess- 
ment is system accountability, then out-of-level testing is 
appropriate because it provides the best estimate of all 
students' skills. It is not appropriate for student account- 
ability because deciding that a student needs a lower 
level test is a declaration that the student does not have 
mastery before the student takes the test. 


Out-of-level testing is also inappropriate for system 
accountability. Because out-of-level testing does not 
really tap what the curriculum is for those students 
tested out of level, it is not appropriate to judge the 
system using a test that does not address what the 
student should know and be able to do. 


Scores from out-of-level testing must be transformed 
to scale scores. It is inappropriate to assume that a raw 
score from an out-of-level test (or a percentile rank based 
on a raw score) can be reported along with scores from 
on-grade level assessments. 


Even transformed scores do not necessarily mean the 
same thing. Using converted scale scores may be just as 
inappropriate because the out-of-level test reflects 
different content also. 



the validity of the test perfor- 
mance. 

When decision makers are consid- 
ering the use of out-of-level testing 
for a particular student, their 
thoughts should immediately turn 
to the appropriateness of instruc- 



O 




tion, and its link to the assessment. 
Thinking only about the assess- 
ment allows one to ignore the 
critical element — the student's 
instruction. Decision makers must 
be able to justify the decision and 
at the same time be able to defend 
the instruction that should provide 



the basis for on-level test perfor- 
mance. 

Unintended consequences of out- 
of-level testing include never 
reaching grade-level, or passing a 
high stakes test. The use of out-of- 
level testing creates unintended 
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consequences beyond simply 
having low expectations for a 
student. When the assessment for 
which out-of-level testing is being 
considered is one that is high 
stakes for the student — such as a 
graduation exam — the use of out- 
of-level testing essentially prevents 
the student from passing the high 
stakes assessment. Unless specific 
procedures are in place for imme- 
diately moving a student into a 
grade-level assessment when the 
student does well on an out-of- 
level assessment, the student 
probably will never reach grade 
level or pass a test that determines 
promotion or graduation. 

Questions to Ask 
Before Using Out- 
of-Level Tests 

Because there are both pros and 
cons to the use of out-of-level 
testing, it is extremely important to 
make good decisions about 
whether out-of-level testing is 
appropriate for an individual 
student, given the purpose of the 
assessment. Likewise, it is impor- 
tant for testing programs to con- 
sider the potential consequences of 
providing out-of-level testing as an 
option that may be selected for 
individual students. There are 
several questions that should be 
asked to help decision makers 
formulate good decisions. 

What is the purpose of the assess- 
ment? If the purpose of the assess- 
ment is student accountability, 
then the use of out-of-level testing 
may make it essentially impossible 
for those students given out-of- 
level tests to pass the tests. This is 
because the lower level tests do not 
allow the student to demonstrate 



mastery at the required difficulty 
level. If the purpose of the assess- 
ment is system accountability, and 
there is no need to report on the 
performance of any subgroup of 
students that might have a signifi- 
cant proportion taking out-of-level 
assessments, then out-of-level 
assessments are possibly appropri- 
ate. If there is a need to be able to 
report on a subgroup of students 
likely to be put into out-of-level 
assessments, such as students with 
disabilities, then the use of out-of- 
level tests creates a problem for 
being able to report accurate 
scores. 

Was the test designed to have 
different levels that are appropri- 
ately connected? Most large-scale 
assessments used by districts or 
states, unless they are norm- 
referenced assessments, were not 
designed to have different levels 
that correspond to different grades 
or groups of grades. Still, tests 
developed via Item Response 
Theory (IRT) have the potential to 
provide the information needed to 
form a common scale across 
disparate grade levels. Those 
considering out-of-level testing 
must make sure that a common 
scale is available across levels to 
have a psychometric justification 
for the use of out-of-level testing. 

Are the unintended consequences 
of out-of-level testing appropri- 
ate? If the assessment is used to 
drive changes in instructional 
practices, then serious questfbns 
must be raised about the use of 
out-of-level assessments. If a test is 
directed to anything less than what 
you want the student to know and 
be able to do, the danger of rein- 
forcing inappropriate instruction is 




considerable. On the other hand, if 
a test is used to determine what 
skills to next teach a student, then 
out-of-level testing may be appro- 
priate, as long as the end goal, the 
standard, is still in sight. 




Summary 



While there are times when out-of- 
level testing may be appropriate, 
there are many times when it is 
not. Careful consideration of the 
assumptions underlying out-of- 
level testing, the purpose of the 
assessment and its characteristics, 
and the potential consequences of 
using out-of-level testing is ad- 
vised for any program or indi- 
vidual decision-making team 
contemplating the use of out-of- 
level testing. 
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Footnotes 



1 Research indicating that performance on 
assessments is linked to curricula includes 
that of Shriner and Salvia (1988), who 
demonstrated that performance on math- 
ematics tests varied as a function of the 
curriculum that was the basis for the 
student's instruction, and Bielinski and 
Davison (1998), who showed that differ- 
ences in the construction of mathematics 
items can account for differences in the 
nerformance of males and females on the 
SAT. 

[Shriner, J., & Salvia, J. (1988). Content 
validity of two tests with two math cur- 
ricula over three years: Another instance of 
Chronic noncorrespondence. Exceptional 
Children , 55, 240-248] 

[Bielinski, J., & Davison, M. L. (1998). 
Gender differences by item difficulty 
interactions in multiple-choice mathematics 
items. American Educational Research journal , 
35 (3), 455-476] 



: Item Response Theory is one approach to 
constructing tests. It is based on the 
characteristics of individual test items. To 
create common scale scores, different levels 
of the test are administered to the same 
students. For example, a state with tests in 
grades 3 and 5 might link the two tests by 
administering both to a sample of grade 4 
students. Raw scores on the two tests are 
linked to form a common scale score. In this 
way, for example, it is determined that a 
raw score of 35 on the grade 3 level test is 
approximately equivalent to a raw score of 
20 on the grade 5 level test. Both of these are 
translated to a scale score of, say, 350. 
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